A typology of North Sea oil and gas platforms

Since the commercial exploitation of marine oil and gas reserves began in the middle of the twentieth century, extensive networks of offshore infrastructure have been installed globally. Many of the structures are now nearing the end of their operational lives and will soon require decommissioning, generating renewed interest in their environmental impacts and in the ecological consequences of their removal. However, such work requires selection of a subsample of assets for surveying; censuses of the entire ‘population’ in any given jurisdiction are practically impossible due to their sheer number. It is important, therefore, that the selected sample is sufficiently representative of the population to draw generalized conclusions. Here, a formal clustering methodology, partitioning around medoids, was used to produce a typology of surface-piercing oil and gas platforms in the North Sea. The variables used for clustering were hydrocarbon product, operational state, platform design and material, and substructure weight. Assessing intra-cluster variability identified 13 clusters as the optimum number. The most important distinguishing variable was platform type, isolating floating platforms first, then concrete gravity-based and then fixed steel. Following clustering, a geographic trend was evident, with oil production more prevalent in the north and gas in the south. The typology allows a representative subset of North Sea oil and gas platforms to be selected when designing a survey, or an assessment of the representativeness of a previously selected subset of platforms. This will facilitate the efficient use of the limited funding available for such studies.

The North Sea is currently covered by legislation which states that (with some exceptions) 'the dumping, and leaving wholly or partly in place, of disused offshore installations with the maritime area is prohibited' 31,32 , and so decommissioning will involve the complete removal of these installations and associated infrastructure. As well as the large financial costs associated with the break-down and removal of these structures, there will also be potentially significant environmental impacts of the decommissioning process, through seabed disturbance 33,34 , potential contamination risk 33,35,36 , and indeed, through the removal of the habitat and opportunities afforded to the local flora and fauna by the structures' physical presence 37 .
This has led to renewed interest in studying the ecological and environmental impacts of these structures. For regulators to make informed decisions about decommissioning options, more information is needed about the roles played by these structures within the North Sea ecosystem, and the potential impacts of their decommissioning and removal. To this end, many environmental studies have looked to describe or investigate the biology and ecology of these systems, with many more currently underway (e.g. INSITE, www. insit enort hsea. org). However, one thing that is rarely considered, or can be limiting to the broad utility of a study's findings, is the selection of a representative sample of structures at which to collect data.
With such a large number of offshore oil and gas assets in the North Sea, it is practically (and, normally, financially) impossible to conduct in-depth sampling at all locations, e.g. sampling at every single surface piercing platform throughout the region. The environmental interactions and ecological impacts of two different structures may be vastly different, however, due to differences in their physical shape, size and design. As such, it may be impossible to extrapolate the findings of a study conducted on only a small number of a single type of platform, to other platform types and locations, and so the conclusions may not be useful when considering ecosystem-wide management planning.
To enhance the applicability of the findings of such studies, ensuring efficient use of the limited funding available (by eliminating the need to repeat the work for a different type of structure), it is essential that a representative sample of structures is selected. Alternatively, if a subsample has already been selected and surveyed, it is important to understand how representative the subsample is of the wider population, so that any limitations of the conclusions can be acknowledged.
To a) select a representative subsample before data collection or b) assess the representativeness of a previously selected subsample from the population of North Sea oil and gas platforms, a formal typology is required, whereby platforms are classified into clusters based on common characteristics. This will mean that, based on the relevant variables selected on which to base the clustering, variability within clusters is much lower than variability between clusters. The relative split of the population between clusters can then be used to either select a representative sample, or to assess the representativeness of a previously selected sample.

Methods
To create a formal typology, a comprehensive list of the items to be clustered [platforms in this case] is required, along with the corresponding complete dataset of variables on which the clustering will be based. Here, the OSPAR inventory of offshore installations 38 was used for both the list of the 'population' of offshore platforms (n = 552) and the variables of interest to be used for the clustering. The variables selected were: hydrocarbon product, platform type, operational status, and substructure weight. Other variables were considered (e.g. water depth, whether the platform is manned or unmanned, latitude and longitude, and produced water disposal method), but were not included for reasons given later, see "Discussion").
These variables include both categorical and continuous data, and so it was necessary to select a clustering methodology that performs effectively with mixed datasets. Partitioning around medoids 39 (PAM) has previously been used for clustering with mixed categorical and continuous data, for a wide variety of applications, including, for example, identifying the psychological effects of COVID-19 40 , clustering fishing vessels into discrete fleets 41 , grouping Indonesian districts for priority for intervention to address stunting 42 , grouping estuaries by a range of biotic and abiotic factors 43 , grouping similar patients presenting with back pain 44 , and identifying different fishing tactics from catch composition 45 among others 46-48 . Prior to the execution of a clustering algorithm, some measure of the distance between individuals is required, based on the variables selected. Here, a Gower distance matrix was used 49 , due to its utility with mixed categorical and continuous data. Gower distance is calculated as an average of the distances between two individuals calculated for each variable being considered. If the variable is continuous, a standardised difference is used (absolute difference divided by the range), and if the variable is categorical, the distance is 1 if the individuals differ, or 0 if they are the same. One drawback of the Gower distance metric is that it is sensitive to outliers and non-normality of continuous variables. Consequentlly, due to the significant right-skewness of substructure weight, the data from this variable were log-transformed to approximate normality; a log(x + 1) transformation was used due to the presence of zeroes in the data (e.g. from the floating structures).
The PAM algorithm applies the following steps, based on the Gower distance matrix, to assign a population of n individuals to k clusters: 1. Assign k randomly selected individuals as cluster medoids. 2. Assign all remaining n-k individuals to the cluster with the most proximate medoid. 3. Reassign as medoid the individual in each cluster which would yield the lowest average distance for that cluster. 4. If a change is made at step 3, return to step 2.
In order to select the optimum number of clusters, the average silhouette width of the population was calculated when arranged into 2-25 clusters. Silhouette width is a measure of the closeness of each individual to where for individual i, s(i) is the silhouette width, a(i) is the average dissimilarity from other members of i's assigned cluster, and b(i) is the average dissimilarity from the members of the nearest neighbouring cluster, i.e. the minimum average dissimilarity between i and the members of each of the other clusters to which i was not assigned. The algorithm was applied using the 'cluster' package in the R statistical programming language 50 .

Results and discussion
Examining the average silhouette width revealed 13 to be the optimum number of clusters. These clusters, as assigned using the PAM algorithm, can be characterised using their medoids as an exemplar individual from the group (Table 1), similar in interpretation to the median of the group. Using complete-linkage clustering, it is possible to build a dendrogram using the separation between the medoids hierarchically based on their Gower distances, to show the how clusters relate to one another in distance (Fig. 1). The most important variable for differentiating clusters was structure type (floating or fixed steel, concrete); the two largest splits separate out first the floating platforms, then the concrete platforms. Examining the spatial distribution of the various clusters, the most obvious spatial trend is a north-south split of oil and gas respectively (Fig. 2).
A formal typology of the oil and gas platforms of the North Sea was created, classifying the 552 individual platforms into 13 clusters. With this typology, and the relative numbers of platforms in each cluster (Table 1), it is possible to select a representative subsample of structures as part of the survey design process for a study which is unable to visit the entire population of platforms. Alternatively, if a subsample has already been selected or sampled, or a survey designer does not have complete freedom to choose which platforms can be surveyed, the representativeness of a sample can be assessed, and so the applicability of the results to the wider population can be highlighted.
The variables selected here were relatively basic, dealing only with some aspects of the platforms' physical size and structure, as well as the hydrocarbon product. For each specific application, a set of variables which are likely to be important in the context of the ecological question being asked should be selected, where available. One difficulty in this, is that the currently available publicly accessible databases (e.g. the OSPAR and OGA databases) are lacking information on some important variables, are incomplete in their records of others, and indeed are inaccurate in yet others.
For example, for a study of fish around oil and gas platforms, there are factors relating to substances discharged from the platform which may affect the fish populations below them. These included whether the structures are normally manned or unmanned (and so have discharge of organic matter in the form of kitchen waste and black-and grey-water), and whether the platform is permitted to discharge produced water (formation water extracted along with the hydrocarbon product and process chemicals) or reinject it back into the reservoir. These data, however, are not included in any public database, and so would require a significant data-mining effort to collect for the entire population, something which was beyond the scope of this study. It may be possible to gather data on these variables for a small selection of platforms (e.g. by contacting the operators directly) and so they could at least be reported as a factor which may affect the ecology of the platforms, even if their comparability with the wider population of unsampled platforms is unknown. www.nature.com/scientificreports/ There are also transient variables which can differ temporally at any given platform but may impact the surrounding environment, particularly mobile species which can vary their distribution over short timescales (whereas sessile organisms cannot). For example, activities such as drilling will emit noise and vibration into the surrounding water, but only whilst they are actively occurring. These activities can vary over a range of timescales, but can extend up to several months at a time of activity or inactivity. While these variables might be impossible to include in a typology (due to both their highly transient nature and the amount of data gathering required for their inclusion, as mentioned above), it is essential that they be considered as important contextual information which may bias the data collected at any given time and location. Some variables have been deliberately omitted following consideration of their relative importance to the 'definition' of each cluster. Water depth, for example, could have been included due to the potential influence it would have on the ecology of the system around a platform [51][52][53][54][55][56] . Additionally, platform location (latitude, longitude, or both) could have been included in the clustering process, as they will affect the ecology of the site 57,58 . These variables were omitted, however, because they are more descriptors of the environment itself, than of the platform. It was decided therefore, that only information about the platform itself would be used for clustering, and the environmental variables can be controlled for (or investigated) as part of the survey design or data analysis of the environmental study. For example, it will be important to look at the distribution of water depths in each cluster, post-hoc, and ensure that representative samples (in particular in the event of a bimodal distribution in the depth data of a given cluster) are selected.
One thing that became apparent over the course of this study is the need for high quality, accurate, publicly accessible databases to be maintained, so that the sort of analysis carried out here can be conducted for future studies using case-appropriate variables for clustering. Much of the information resulting from ecological studies of oil and gas infrastructure may be limited by the number of platforms sampled and a lack of clarity over the respresentativeness of the subsample selected. The current readily accessible databases, while a useful starting point, are limited in the number of potentially ecologically relevant variables they contain, and there are some issues with the accuracy and maintenance of some of the datasets contained therein (e.g. the location data in the OSPAR inventory of the offshore installations contains numerous inaccuracies).

Conclusions
A typology of oil and gas structures in a given study area (here, the North Sea) is essential for selecting a subsample which is suitably representative of the wider 'population' . This will increase the extent to which the conclusions drawn from a study can be generalised, allowing the more efficient use of limited resources available for such studies. The work highlights the need for high quality, accurate databases of information about offshore oil and www.nature.com/scientificreports/ gas infrastructure to be maintained (including a range of relevant variables) so that a similar typology can be created using any and all characteristics deemed of importance to a new study.

Data availability
The data analysed in this article are freely available online from the OSPAR Data and Information Management System (https:// odims. ospar. org/ en/). Map of oil and gas platforms in the North Sea. Symbols denote the cluster to which the platform was assigned during the clustering process, and are coloured by hydrocarbon product. The clusters are designated as structure type_status_product, with the abbreviations being used: Fi, Fl and Co for Fixed steel, Floating steel and Concrete gravity base; Op, Cl and Deco for Operational, Closed down and Decommissioned; and Oil, Gas and Con being Oil, Gas, and Condensate. The map was generated using the 'maps' and 'mapdata' packages in R (v4.1.2; https:// www.r-proje ct. org).